home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Kirk's Comm Disc 1995 December
/
Kirk's Comm Disc - Version 2.iso
/
dos
/
fido
/
ftsc_all.z43
/
FSC-0044.001
< prev
next >
Wrap
Text File
|
1990-04-01
|
29KB
|
564 lines
Document: FSC-0044
Version: 001
Date: 04/01/90
An improved method of duplicate message detection and prevention
Jack Decker
1:154/8@Fidonet
Status of this document:
This FSC suggests a proposed protocol for the FidoNet(r) community,
and requests discussion and suggestions for improvements.
Distribution of this document is subject to the restrictions stated
in the copyright paragraph below.
Fido and FidoNet are registered marks of Tom Jennings and Fido
Software.
Purpose:
The purpose of this document is to present a proposal for an improved method of
duplicate message detection and prevention, that could eventually replace the
existing PATH and SEEN-BY lines currently used within Fidonet. The principal
advantages of this method over previous schemes is that it is fully Zone-aware
and Point-aware, and that it adds far fewer bytes to a message than the current
combination of SEEN-BY and PATH lines. It can also be run in parallel with
existing SEEN-BY and PATH lines for an indefinite period, thus allowing a
"transition period" of as long as is necessary for software to be converted.
Copyright:
This document is Copyright 1990 by Jack Decker. However, permission is granted
for any and all non-commercial use, providing the contents of this document are
not altered in any way. Also, permission is expressly granted for any use by
developers of software primarily intended to be used in the Fidonet amateur
communications network, or in any similar amateur communications network that
primarily uses Fidonet technology and protocols, whether that software is
commercial or not.
Comments on this proposal, and suggestions for improvement are welcomed by the
author. In particular, suggestions on how this proposal might be reworded to
make the meaning clearer are especially welcome.
A. Definition:
In this document the characters ^A (caret and capital A) will be used to
represent a 0x01 (SOH) byte. It will be most commonly used in reference to the
"^APTH line", which will be a line that begins with a 0x01 byte immediately
followed by the letters "PTH" (and then by additional data as specified below).
B. Why a new method of duplicate message detection is needed:
Most of the methods of duplicate message detection currently used in Fidonet
echomail processing operate by trying to find some distinguishing
characteristic of an echomail message (whether it be something deliberately
included in the message, such as an EID, MSGID, etc. type of "kludge line", or
something which is contained in all echomail messages, such as the message
header). Typically, either the item being used for duplicate detection itself
or a checksum of that item is then saved in a data file, and if another item
comes in with that same distinguishing characteristic, the message is
considered to be a duplicate message. The data files used to store
previously-seen message data can occupy a significant amount of disk space if
many conferences are carried on a system.
Unfortunately, all such schemes seem to suffer from the drawback that under the
proper circumstances, messages that are not duplicates of each other may be
created with identifying characteristics that are similar enough to be falsely
recognized as duplicates. The circumstances under which this can happen may
differ from method to method, but none are totally foolproof. Thus, it's
possible that messages may be deleted as duplicates even though in reality they
are not duplicates, but rather they are simply messages that contain data that
make them appear to be duplicates.
The most common cause of duplicate messages is improper echomail topology (also
known as the infamous "dupe loop"). While there are certainly other ways that
duplicates can be generated, improper topology is far and away the leading
cause. Further, many of the current duplicate elimination schemes will NOT
catch most of the duplicates that are NOT generated as a result of improper
topology (which is why duplicate messages are seen occasionally, even though
duplicate message detection schemes are currently in use).
Unfortunately, if a duplicate killer is to be effective, it must store the
identifying characteristics for the last several thousand messages that have
passed through a particular system. This not only uses up disk storage space,
it consumes extra processing time during echomail processing, since each new
arriving message must be compared to the data list in the attempt to determine
if it is, indeed, a duplicate.
A better approach would be to store within a message itself data that
identifies it as having already been received by a particular system, before
sending it on to another system. Then, if the same message came back to a
given system in a "dupe loop", it would be possible to positively identify it
as one that has already been seen on that particular system. And, since the
data necessary to identify the message as a duplicate is stored within the
message itself, it is possible to use this method without the necessity of
storing great amounts of data on previously-seen messages (in many
implementations this alone would save 10K or more of disk space per conference
carried!).
Were it not for the fact that the PATH line present in most echomail messages
does not contain Zone or Point information, we could use it for that purpose.
However, since it does not contain that information, it cannot and should not
be used in that manner. Another drawback of the PATH line is that because it
is physically located at the end of a message (after the SEEN-BY lines), if
only the last part of a message is scrambled or deleted, the PATH line
information will be lost.
C. Proposal:
1) A new type of kludge line (commonly known within FIDONET as an "IFNA kludge
line"), which combines certain characteristics of the existing PATH and SEEN-BY
lines, will be placed into each echomail message. This line, known as the
^APTH line, will be placed at the TOP of the message (not necessarily the first
line, but prior to the body of the message text). IMPORTANT: Support for the
existing PATH and SEEN-BY lines will be retained as long as is necessary to
accommodate everyone in Fidonet, but eventually the ^APTH line could possibly
replace both the current PATH and SEEN-BY lines.
2) The ^APTH line will contain full four-dimensional addressing
(Zone:Net/Node.Point), BUT elements that are the same as the previous entry in
the line need not be repeated. When the "point" portion of an address stands
alone, it shall be preceded by at least a "." character (to distinguish it from
a node address).
3) If a system is importing messages and finds a message with its own address
already in the ^APTH line, it will discard the message (unless that address is
in the very last position on the line... this allows for the odd situation
where a point or another task on the same system has already inserted the
system's address in the ^APTH line, or where it is desirable to process the
same message a second time).
4) One (and only one) non-numeric character may appear just AFTER any address
on the ^APTH line. When using the ^APTH for duplicate message checking only,
you may just skip past any such address, unless it's your own address (see
examples later in this document). In that case, strip the address and the
non-numeric character (in other words, if you see your own address but it's
immediately followed by a non-numeric character, remove that address, add yours
to the end of the ^APTH line, and toss the message anyway). The reason for
doing this is to allow the design of an echomail processor that doesn't rely on
SEEN-BY's. Such a processor could append a non-numeric character (such as a
"!") to an address, in order to indicate that "this message hasn't really
passed through this node, but don't send it back there" (which would be the
equivalent of a SEEN-BY statement for that node, indicating that this message
has already been sent to that node). Thus the ^APTH line could eventually take
the place of SEEN-BY lines, but you could still have a "fully coupled"
triangular or rectangular topology. In this case, you'd add the nodes that are
part of that fully coupled topology to the ^APTH line BEFORE sending the
message to them, but with the special character after the address. The
receiver would know that the message didn't really pass through that node yet,
but it should NOT send it to that node under any circumstances.
(Please note that during the initial design of software to create ^APTH lines,
you would not have to worry about generating the special case with the trailing
non-numeric characters, you'd just have to be able to handle them as shown in
the examples below above if you came across one).
D. Specifications and examples:
The general specifications for a ^APTH line, and a general outline of how an
incoming message might be processed follows.
A valid ^APTH line will contain at a minimum the string ^APTH followed by a
single space character and the network address of the system that created the
^APTH line, in Zone:Net/Node[.Point] format, where ^A is a 0x01 byte (SOH) and
the point address is required only if the system is a point (specifically, a
system that is NOT a point should not .0).
Once again, the FIRST address specified in a ^APTH line is expected to contain,
at a minimum, Zone, Net, and Node numbers. If any of these are missing from
the FIRST address, the line should be considered defective. It will be left to
the discretion of the software author as to how to handle a message with a
defective ^APTH line.
Subsequent addresses in the ^APTH line are delimited by spaces and should
contain only that information that is different from the previous entry on the
line, except that when a bossnode receives a message from a point, then the
bossnode should append its node number only. Specific examples follow:
a. If the Zone is the same as the previous address, but the net is
different, then only Net/Node[.Point] should be used.
b. If the Zone and Net are the same as the previous address, but the
node is different, then only Node[.Point] should be used.
c. If the Zone, Net, and Node are the same as the previous address,
but the point is different, then only .Point should be used. Note
that in this case, the period is included.
d. If the Zone, Net, and Node are the same as the previous address, but
the previous address contains a point specifier and the receiving
system is not a point (i.e., it IS the bossnode), then only Node
should be used. .0 (point zero) might also be a valid entry in this
case, but only if the bossnode consistently identifies itself to other
systems using a full four-dimensional address. For example, a message
that originated on 1:234/5.6 and went from there to 1:234/5 would
contain a ^APTH line in this format:
^APTH 1:234/5.6 5
If the bossnode is also considered to be point zero, then
^APTH 1:234/5.6 .0
Would be equally valid.
In the case of a "fully connected" topology, nodes may be added to the ^APTH
line even though a message has not actually passed through those nodes, to
prevent the message from being sent to those nodes. Such nodes should have an
exclamation point character ("!") appended to the end of the entry, immediately
following the node or point number. These nodes should be added to the very
end of a new or existing ^APTH line, after the address of the node which added
them.
For example, suppose that 157/200, 154/9, and 228/6 were in a "fully connected"
topology. When a message was received by 157/200 and then sent to 154/9 and
228/6, the ^APTH line might look something like this:
^APTH: 3:711/431.5 431 430 403 1:124/4210 4115 157/200 154/9! 228/6!
When a message arrives on one of the nodes indicated by the exclamation point,
the exclamation point entry should be removed, and the node should add itself
to the end of the line in the normal manner. For example, after the message
containing the above ^APTH line were received at 154/9, it would be modified to
read:
^APTH: 3:711/431.5 431 430 403 1:124/4210 4115 157/200 228/6! 154/9
Please note that at the time of this proposal, the exclamation point (!) is the
ONLY "officially recognized" non-numeric character that can be expected to be
appended to a ^APTH line address, however, the possibility remains that someone
may figure out a good reason to use a different trailing character for some
other (but similar) purpose, so I am allowing for that possibility by using the
generic terminology "non-numeric character" rather than the more specific
"exclamation point" throughout this document.
The ^APTH line must be terminated with a carriage return and/or linefeed (a
carriage return followed by a linefeed is preferred, and should be used by all
systems capable of generating a carriage return/linefeed combination).
There is no specified limit on the length of a ^APTH line. Each message should
contain only one ^APTH line, even if it extends beyond the typical 80 column
screen width. The ^APTH line is primarily intended for use by the conference
mail processing software, so primary consideration is being given to ease of
processing the line, rather than making it easily human-readable (most software
will not display kludge lines hidden behind a ^A character in any event).
E. Pseudo-outline of message processing
Here is a suggested flow pseudo-outline showing one way that messages might be
processed in a standalone program that runs between the import and export
cycles of an existing conference mail processor such as ConfMail (this outline
assumes that the standard Fido/Opus style *.msg files are used, though
obviously that need not be the case):
1. Open *.msg file for input
2. Open temporary file for output
3. Copy header (first 190 bytes) from input to output file. The following
operations begin immediately following this header.
4. Examine each line of input file (a line is delimited by a carriage return,
linefeed, or any combination thereof) for one of the following:
a. A blank line - Write to output and examine next line.
b. A line containing spaces only - Write to output and examine next line.
c. A line that begins with a 01 byte (SOH) - GoTo 5.
d. A line that does not meet any of the above specifications.
I. Create a line containing the string ^APTH followed by a single
space character and your network address, in Zone:Net/Node[.Point]
format, where ^A is a 0x01 byte (SOH) and the point address is
required only if you are a point (specifically, a system that is
NOT a point should not necessarily use .0). This line should be
terminated with a carriage return and/or linefeed (a carriage
return followed by a linefeed is preferred).
II. Write the line created in 4.d.I. to the output file.
III. Write the line input in 4. to the output file.
IV. Goto 9.
5. If a line begins with a 0x01 (SOH) byte, examine the keyword immediately
following it.
a. If the keyword is NOT "PTH", write the entire line to output and
examine the next line (go back to 4).
6. If a line begins with ^APTH, examine each address in the line in turn.
Addresses start immediately following the characters "PTH " (note the
space).
a. The FIRST address is expected to contain, at a minimum, Zone, Net,
and Node numbers. If any of these are missing from the FIRST address,
the line should be considered defective. It will be left to the
discretion of the software author as to how to handle a message with a
defective ^APTH line.
b. As each address is found, any Zone, Net, and Node numbers found should
be stored in temporary variables, to be used as defaults for subsequent
addresses when only a partial address is given. For example, the first
address will contain a Zone number. This should be stored in a
temporary variable and used as the default Zone for all subsequent
addresses, until and unless another Zone number is seen in the line, at
which time that Zone number should be stored in the temporary variable
and used as the default Zone.
7. As each address is found, it should be compared against the system's
address. If a match is found:
a. Check to make sure that the address is not a point address if the
system's address does not contain a point specifier. If the system's
address is given without a point specifier, then it should not be
considered a match if ANY point address is found in the ^APTH line
address that is being compared (not even .0 - for example, if the
address 1:234/5.0 is seen in the ^APTH line, and 1:234/5 is the given
system address, then it is NOT a match).
b. If the address does match exactly, check to see if a non-numeric
character (specifically the "!" character) immediately follows the
address. If it does, then that address must be removed from the
line at that point.
I. When removing an address, please make sure that you do not change
the address of subsequent entries. This may require modification
of the next entry on the line, if one exists. For example,
suppose you had a "fully connected" topology where 1:157/200 sent
an echo to both 1:154/9 and 1:154/970. The ^APTH line might end
as follows:
..... 157/200 154/9! 970!
However, after modification of the ^APTH line, it should read:
..... 157/200 154/970! 9
You can see that if 154/9 were simply deleted without regard to
what follows on the line, the following (incorrect) line might
result:
..... 157/200 970! 154/9 (THIS IS INCORRECT)
The above is incorrect because 154/970 has been transformed into
157/970.
II. After removing an address followed by a non-numeric character,
continue to scan any remaining addresses in the ^APTH line in
case a match is found later in the line. If no other matches
are found, proceed as if no match had been found. Goto 8.
c. Check to see if the address is the last one on the line (not counting
addresses that have a non-numeric character immediately following
them). If this address is followed only by the end of the line, or
ONLY by addresses that have a non-numeric trailing character, then
there is a very high probability that we have either inadvertently
or deliberately processed this message twice, and it is not really a
duplicate. In this case, the original *.msg file should be left
untouched.
I. Close both the input and output files.
II. Delete the temporary output file. END.
d. If a match is found, and it is not followed by a non-numeric
character, and it is not the last address on the ^APTH line, then the
message is a duplicate message and should be treated as such (either
by deleting it, or moving it to a "bad messages" area or the netmail
area).
I. Close both the input and output files.
II. Delete the temporary output file.
III. Either delete or move the original .msg input file, as
appropriate. END.
8. If the end of the ^APTH line is reached and a match has not been found,
then add the system's address to the end of the ^APTH line. Then write
the modified ^APTH line to the output file.
I. If one or more addresses with an appended non-numeric character
(used within "fully-coupled" topologies) are to be added to the
^APTH line, they should be added at the very end of the line,
after the address of the system currently processing the
message).
9. Copy the remainder of the input file to the output file. Close both files.
10. Delete the input file.
11. Rename the temporary output filename to the old input filename. END.
[End of outline]
F. Additional notes and clarifications:
Note 1: In section 7.b.I. I mentioned the necessity of not simply deleting a
node from the ^APTH line without checking to see if the next address in the
^APTH line needs to be modified. This can easily be accomplished if TWO sets
of temporary variables are kept, for the CURRENT and PREVIOUS Zone, Net, and
Node values (Point addresses are NOT kept as defaults, thus there is no need to
store Point information). When reading the FIRST address in the ^APTH line,
the Zone, Net, and Node numbers of that address would be stored in both the
CURRENT and PREVIOUS variables. Thereafter, whenever a new Zone, Net, or Node
number is explicitly specified in a ^APTH line address, the new value(s) are
stored in the CURRENT variables, but first the CURRENT values are moved to the
PREVIOUS values.
To help visualize this, let's again suppose we have a ^APTH line that ends as
follows (all of these addresses are in Zone 1):
..... 157/200 154/9! 970!
Let's suppose that we are processing this message on 154/9, and will need to
remove the 154/9! address. When we encounter 157/200, our variables will be
set as follows:
Previous | Current
Zone 1 | 1
Net ? | 157
Node ? | 200
Now, when we read 154/9, our current values will be moved to the previous:
Previous | Current
Zone 1 | 1
Net 157 | 154
Node 200 | 9
We now have the data we need to determine what needs to be added to the next
address, after we delete 154/9. In this case, we need only compare the
Previous and Current addresses to determine which are UNEQUAL. In this case,
the Zone is the same, but the Net and Node are not. So, if the following
address lacks either the Net or Node, we'll have to add those. Now we delete
the 154/9! and look at the next address, 970. At this point our variables will
look like this:
Previous | Current
Zone 1 | 1
Net 154 | 154
Node 9 | 970
Again, we compare to see which addresses are UNEQUAL. In this case, only the
NODE address is. So we know we do NOT have to add the NODE address, nor do we
have to add the Zone address (because it was not different on the first
compare). We only need add those address components which were unequal on the
first compare, but equal on the second compare. So, in this case, the Net
address must be added to the next address in the ^APTH line, leaving as a
result:
..... 157/200 154/970!
The current system address is then added back in at the end of the line, thus:
..... 157/200 154/970! 9
Note 2: In section 4.d it is suggested that, when a line that is neither blank
nor a kludge line (that begins with a ^A character) is found, a ^APTH line be
added at that point. However, there are reports that under certain
circumstances (particularly when messages are "forwarded" or "hurled"), certain
software may insert a non-kludge line prior to previously-existing kludge lines
in a message. It should be recognized by software authors that a non-kludge
line should NEVER be inserted in front of existing kludge lines located at the
start of a message, if those kludge lines are still valid (and if they are NOT
still valid, they should be removed. When a message is forwarded or hurled, it
is probably desirable to remove duplicate control information since what is
essentially happening is that the text of a previous message is being inserted
into a NEW message. Since the message is new, the "old" duplicate control
information is no longer valid).
Software authors that are implementing the ^APTH line in their software should
never search beyond the first text line of a message for the ^APTH line,
because if one is found later in the text, it is in all probability an old
^APTH line that was inadvertently copied over from another message, and is not
relevant to the current message.
Note 3: This is an optional suggestion, for use during the transition period
in which the ^APTH line has to coexist with more traditional PATH and SEEN-BY
lines. If ^APTH line checking is being used during the import phase of
echomail processing in a conference mail processor, it might be a good idea to
optionally check to make sure that all ^APTH line addresses that are in the
system's home Zone (including those with an appended non-numeric character)
have been properly included in the SEEN-BY lines, and to add any that have not
been so included. It should be obvious that ^APTH line addresses that are NOT
in the system's home Zone should NOT be added to the SEEN-BY lines. If this
feature is implemented, it may be a good idea to give the sysop the ability to
enable or disable it by means of a command line switch or a configuration file
option.
Note 4: If nodes with trailing non-numeric characters are inserted into a
^APTH line for the purpose of indicating "SEEN-BY" nodes in a fully coupled
topology, it is permissible (but not required) to include those nodes ONLY in
the ^APTH lines of messages actually exported to the nodes participating in the
circular topology. In other words, it's permissible to add such nodes to the
^APTH lines of messages during the import cycle, in which case messages with
^APTH lines containing the added nodes would be exported to all nodes.
However, it's also permissible to add those nodes to the ^APTH line during the
export cycle, including them only in the ^APTH lines of the nodes that need to
see them. Please keep in mind that such nodes are added only to the END of the
^APTH line, AFTER the address of the system processing the message. In any
event, it's up to the software author to implement this feature in such a
manner that duplicates will not be created.
Similarly, if a node RECEIVES a message containing a ^APTH line that lists
nodes with trailing non-numeric characters, it is permissible to remove those
nodes from the ^APTH line if it can be positively ascertained that they are no
longer required. Generally speaking, this should NOT be done unless there is
absolutely NO possibility of the message being exported to one of the nodes in
question. Note that in most situations, if a ^APTH line contains a node with a
trailing non-numeric character, but it is followed by a node number (other than
your own) that does NOT have a trailing non-numeric character (that is, the
node with the trailing non-numeric character is not one of the last nodes on
the line), then it can usually be safely removed since it will have already
"passed through" the fully-coupled topology.
Using the previous example of 157/200, 154/9, and 154/970 participating in a
fully-coupled topology, the ^APTH line as received at 154/9 and 154/970 might
end as follows:
..... 157/200 154/9! 970!
However, please note that if 157/200 also feeds other nodes that are NOT part
of this particular fully coupled topology, there is not real reason they would
have to see the "154/9! 970!" at the end of the line. However, there is no
prohibition against including those nodes in the ^APTH lines of messages
exported to other nodes.
Once this example message arrives at 154/9, the ^APTH line would be changed to
look like this:
..... 157/200 154/970! 9
Now, when this message is exported from 154/9 to another node (154/111 for
example), that node may remove the "154/970!" as long as 154/9 remains in the
^APTH line, since as long as the message cannot be sent back to 154/9, it
cannot re-enter the fully-coupled topology. The ^APTH line at this point
(after the message is received on 154/111) might look like this:
..... 157/200 154/9 111
It would probably not be advisable to remove the "154/970!" at 154/9 in this
example, even if the message has already been exported, because the message
might need to be re-exported (such as when a new board picks up an echo feed).
When in doubt, nodes with trailing non-numeric characters (other than your own)
should be left in the ^APTH line. While there is a cost of a few extra bytes
per message if you leave them in, it does not compare to the cost of the
duplicate messages that could be generated if they are removed
indiscriminately.
Jack Decker
April 1, 1990